Genetic diversity and population structure of 93 rice cultivars (lines) (Oryza sativa Xian group) in Qinba in China by 3 types of genetic markers

Background The Qinba region is the transition region between Indica and Japonica varieties in China. It has a long history of Indica rice planting of more than 7000 years and is also a planting area for fine-quality Indica rice. The aims of this study are to explore different genetic markers applied to the analysis population structure, genetic diversity, selection and optimization of molecular markers of Indica rice, thus providing more information for the protection and utilization on germplasm resources of Indica rice. Methods Fifteen phenotypic traits, a core set of 48 SSR markers which originated protocol for identification of rice varieties-SSR marker method in agricultural industry standard of the People's Republic of China (Ministry of Agriculture of the PRC, NY/T1433-2014, Protocol for identification of rice varieties-SSR marker method, 2014), and SNPs data obtained by genotyping-by-sequencing (GBS, NlaIII and MseI digestion, referred to as SNPs-NlaIII and SNPs-MseI, respectively) for this panel of 93 samples using the Illumina HiSeq2000 sequencing platform, were employed to explore the genetic diversity and population structure of 93 samples. Results The average of coefficient of variation (CV) and diversity index (He) were 29.72% and 1.83 ranging from 3.07% to 137.43%, and from 1.45 to 2.03, respectively. The correlation coefficient between 15 phenotypic traits ranged from 0.984 to -0.604. The first four PCs accounted for 70.693% phenotypic variation based on phenotypic analysis. A total of 379 alleles were obtained using SSR markers, encompassing an average of 8.0 alleles per primer. Polymorphic bands (PPB) and polymorphism information content (PIC) was 88.65% and 0.77, respectively. The Mantel test showed that the correlation between the genetic distance matrix based on SNPs-NlaIII and SNPs-MseI was the largest (R2=0.88), and that based on 15 phenotypic traits and SSR was the smallest (R2=0.09). The 93 samples could be clustered into two subgroups by 3 types of genetic markers. Molecular variance analysis revealed that the genetic variation was 2% among populations and 98% within populations (the Nm was 0.16), Tajima’s D value was 1.66, the FST between the two populations was 0.61 based on 72,824 SNPs. Conclusions The population genetic variation explained by SNPs was larger than that explained by SSRs. The gene flow of 93 samples used in this study was larger than that of naturally self-pollinated crops, which may be caused by long-term breeding selection of Indica rice in the Qinba region. The genetic structure of the 93 samples was simple and lacked rare alleles.

It was found that there was a significant correlation between the genetic distance obtained by the two types of SNPs markers, while the lowest correlation occured between the genetic distance of phenotypic traits and SSR data. The population genetic variation explained by SNPs was larger than that explained by SSRs among DNA molecular markers. Bayesian clustering algorithm was superior to the other two clustering methods. The genetic structure of 93 samples representative of the diversity present in Qinba area in China of Oryza sativa Xian group was simple and lacked rare alleles.

Background
According to the origin and the history of rice cultivate in China, the two major types of Oryza sativa L. are classified as O. sativa L. subsp. hsien Ting and O. sativa L. subsp. keng Ting proposed by Ding Y [1,2], and the naming method of O. sativa L. subsp. indica Kato (also known as O. sativa Xian group) and O. sativa L. subsp. japonica Kato (also known as O. sativa Xian group ) for O. sativa L. proposed by Kato was used internationally at the present [3], in which Indica rice is distributed mainly in the southern Qinling Mountains in China. The Qinba area is the climate transition area between the northern and southern areas as well as the transition area from Indica rice to Japonica rice, which is also the most suitable planting area for Indica rice in China. Germplasm resources form the basis of all breeding work; the analysis of genetic diversity and genetic structure is beneficial to mining excellent breeding materials and improving breeding efficiency. Particularly, in-depth genetic dissection of Indica rice germplasm resources have not been conducted. The population genetic structure is the non-random distribution of genes or genotypes in space and time, including genetic variations within populations and genetic differentiation between populations. Population structure analysis is essential to explore the biological adaptability, population formation process, evolutionary mechanism, protection, and development of biological resources. At the same time, populations with identical or similar genetic backgrounds is most suitable for genome-wide association studies (GWAS), therefore, the study of population genetic structure plays an important role in the field of biology, in which the selection of genetic markers is the top strategic priority, ranging from earlier morphological markers to more recent different types of DNA molecular markers [4][5][6][7]. The Indica rice genome has simple sequence repeats (SSRs) that span approximately 10-50 kb [8,9]. In the last few decades, SSR molecular markers have become important tools in the field of biology, particularly in terms of population structure, genetic mapping, and other related fields, SSR markers have also become the designated markers of the International Fingerprint Mapping Center [10][11][12]. These are employed in judicial identification, identification of new varieties of plants, such as rice, rape, and corn [13][14][15][16][17][18]. SSR markers are also used in DNA fingerprinting for breed protection [19]. However, SSR markers are scarce, show unbalanced distribution in the genome, have weak electrophoretic resolution, and are relatively time-consuming and labor-intensive to study, and thus it is difficult to construct high-density genetic maps. With the recent development of next-generation sequencing technology, most biological studies have rapidly improved, in particular, the use of single nucleotide polymorphisms (SNPs) based on genome-wide scans. With the release of extensive rice genome sequencing data, one SNP in every hundred base pairs or even every dozens of base pairs has been identified, indicating that there are numerous SNPs in the rice genome [6,20]. A small number of SNPs can be used to resolve many problems, so the sequencing technology was born based on simplified genome by restriction site-associated DNA (RAD) tags [21]. The frontrunner among these technologies is genotypingby-sequencing (GBS), which has recently gained attention because it utilizes methylation-sensitive restriction endonucleases (type II enzyme), thereby avoiding repetitive regions of the genome (methylated regions). GBS technology can rapidly identify high-density polymorphisms, especially SNPs [22]. In this study, two type II enzymes (NlaIII and MseI) were selected by simulated whole-genome enzyme digestion, which generated RAD tags for sequencing to obtain SNPs datasets, referred to as SNPs -NlaIII and SNPs -MseI , respectively. Simultaneously, a core set of SSR markers from NY/T1433-2014 [23] that originated in the Agricultural Standards of the People's Republic of China and 15 phenotypic traits were employed to explore gene flow and population genetic structure of 93 samples and to provide reference for future research studies using different genetic markers employed in related fields. long-term breeding selection of Indica rice in the Qinba region. The genetic structure of the 93 samples was simple and lacked rare alleles.

Phenotypic traits diversity and cluster analysis Phenotypic diversity analysis
Data of the 15 phenotypic traits of the 93 samples are summarized in Table 1. The basic statistical analysis and diversity of the 15 phenotypic traits based on phenotypic data is shown in Table 2. The coefficient of variation (CV) was 29.72% on average and ranged from 3.07% (brown rice rate) to 137.43% (chalkiness). The average diversity index (H e ) was 1.92 and ranged from 1.55 to 2.08, with brown rice rate and chalkiness having lower H e , indicating that there were relatively few phenotypes in these two traits. Overall, the Indica rice materials tested had relatively more phenotypes on these 15 traits; the distribution in each phenotype was uneven Tables 3 and 4.

Phenotypic traits clustering
Average Euclidean distance was 5.19, ranging from 0.90 (between W723 and W742) to 13.73 (between W699 and W733). Clustering result based on the 15 phenotypic traits was shown in Fig. 1, which demonstrated that the 92 samples were clustered together in addition to W669 and showed a single genetic basis for the population.

Polymorphism of SSR markers
A total of 378 bands was detected using 48 core SSRs primer pairs (Table 5). Among these, 336 polymorphic bands were detected. The average number of polymorphic fragments was 7, ranging from 1 to 14. The highest number (14) of polymorphic bands was detected by RM278 while RM311 is the least bands. The average value of PPB (Percentage of polymorphic bands) was 88.87%, ranging from 50% to 100%. The average value of PIC (Polymorphism information content) was 0.77, ranging from 0.19 to 0.88. Data showed that core SSR in rice can produce rich bands and high polymorphic rate.

Clustering based on SSR
PC, in which the first three PC (eigenvalue) to select and their cumulative contribution of variance accounted for 15.76%, and the unweighted pair-group method with arithmetic means (UPGMA) were performed, which demonstrated that the 93 genotypes could be divided into 2 subgroups (Fig. 2).

Bayesian clustering based on SSR markers
A total of 378 SSR bands was used to elucidate the population structure of the entire pool of 93 rice germplasms. The best K was K = 2, suggesting that the 93 rice germplasms were best divided into two subgroups (Fig. 3).

SNPs marker analysis
A total of 39,872 SNPs -NlaIII and 35,547 SNPs -MseI passed the minor allele frequency (MAF) lower limit of 0.05 using NlaIII and MseI digestion, respectively. Merged data of SNPs -NlaIII and SNPs -MseI , with a total of 72,824 SNPs including 67,621 SNPs that aligned to specific chromosomes and 5,023 SNPs unlocalized, were then obtained.

Linkage disequilibrium (LD) and haplotype analysis
From the total of 6,288,753 loci (93 samples × 67,621SNPs), 326,873 (5.198%) were heterozygous. The 67,621 SNPs were unevenly distributed on the 12 chromosomes (Fig. 4a); chromosome 1 contained the largest number of makers (8,425), while chromosome 8 included the least (3,953). LD, as represented by inter-loci R 2 values, was calculated for the 84,255 SNP pairs. R 2 value had a minimum of 0.2 and an average of 0.73. 46,322 SNP pairs (54.98%) had R 2 values higher than 0.8, while 7,841 pairs (9.31%) were in complete LD (R 2 =1). The 12 chromosomes yielded a total of 6,568 predicted haplotypes (Fig. 4b), with chromosome 1 possessing the most haplotypes (776) and chromosome 10 possessing the least (349). The largest haplotype was composed of 95 SNPs. The longest haplotype spanned over 200.0 kb; the average haplotype length was 33.71kb.

AMOVA and gene flow
The average MAF of the 93 samples was 0.21. Tajima's D value was 1.66, which suggests low levels of both low-and high-frequency polymorphisms, indicating a decrease in population size and/or balancing selection that resulted in more haplotypes and lacked rare alleles in this population. Analysis of molecular variance (AMOVA) showed that the genetic variation was 98% within the population and 2% between populations, which indicated the existence of slight genetic variation among 93 samples. The genetic differentiation coefficient (F ST ) between the two populations was 0.61, and gene flow (N m ) was 0.16. Further investigation showed that the gene flow of selfing crops was the smallest, and that of annual herbaceous plants was the lowest. If N m > 1, which indicates that the level of gene flow between populations is high, then genetic differentiation among populations is small; if N m > 4, then gene communication between populations is more adequate and genetic differentiation is smaller; and N m < 1 indicates that population differentiation may have occurred due to genetic drift. The gene flow was 0.16, which indicates Note: M, R and S in the table refer to the maintainer line and restorer line of rice CMS Lines, and special rice. Phenotypic trait number (1 to 15) in first row correspond from left to the right to The period from seeding to heading (d); Plant heights (cm); Leaf length (cm); Leaf width (cm); Average single plant valid spike number; Spike lengths (cm); Kernel numbers per spike; Grain numbe; 1000-seed weights (g); Brown rice rate (100%); Milled rice rate (100%); Head rice rate (100%); Chalky rice rate; Chalkiness; Length-width ratio that the gene flow among populations in the Qinba region is lower, but nearly 2.5-fold higher than that of conventional inbred plants, which may result in longterm artificial selection, leading to reduced genetic differentiation.

PC clustering
Principal component analysis was performed to select the first three PCs (based on eigenvalue). Their cumulative contribution of variance accounted for 40.69%, Table 2 Basic statistical analysis and diversity of the 15 phenotypic traits Most phenotypic traits were correlated or significantly correlated. The most significant correlation was between kernel numbers per spike and grain number, followed by that between chalky rice rate and chalkiness. However, the correlation between chalky rice rate and length-width ratio was the least significant, followed by that between leaf width and average single plant valid spike number (  Table 3 Pearson correlation coefficient analysis of the 15 phenotypic traits Note: Asterisk indicates significant difference between phenotypic traits using two-tailed t-tests. *P<0.05; **P<0.01 Principal components were extracted based on the criterion that the eigenvalue was greater than 1.0. The eigenvalues of the first four PCs in 15 phenotypic traits were greater than 1.0, and together accounted for 70.693% of the phenotypic variation (  39.76% and 40.10% for SNPs -NlaIII , SNPs -MseI , merged data of SNPs -NlaIII and SNPs -MseI , respectively, which demonstrated that the 93 genotypes could be clustered into two subgroups by the first three PCs (Fig. 5), with W366 and W367 being always separated from other samples.

UPGMA clustering
The unweighted pair-group method with arithmetic means (UPGMA) algorithm was performed and demonstrated that the 93 genotypes could be divided into 2 subgroups (Fig. 6), which was consistent with the PC results. Group I included 1 to 3 samples, while group II contained 92 to 90 samples. The average genetic distance was 0.29, ranging from 0.02 to 0.55 based on merged SNPs -NlaIII and SNPs -MseI data. The two most closely related materials were W710 and W711, and the two most furthest materials were W366 and W740.

Bayesian clustering
Seventy-two thousand eight hundred twenty-four SNPs (MAF <5%) were used to assess the population structure of the entire pool of 93 samples. Delta K reached a maximum value at K=2, suggesting that the 93 samples were divided into two subgroups (consisting of 70 and 23 samples) (Fig. 7). In the population structure analysis, the results from K = 2 to K = 5 revealed the occurrence of gene introgression between groupІ and groupII, accounting for approximately 76.34% of the observed variations (calculated with K = 2).    The analysis performed based on phenotypic traits, SSR and SNPs data using PCA, UPGMA and Bayesian clustering led to inconsistent results, the population structure is relatively simple, the matrix delamination is not distinctive.

Clustering of different category materials
Population genetic information of different category samples, including 57 restoring lines, 19 maintainer lines and 17 special rice lines was analyzed (Table 6) and clustered (Fig 8a, b, c, respectively). Results showed that the genetic basis of the restorer line was more abundant than that of the maintainer line, and that the genetic basis of the special rice was wider than that of the conventional rice.

Correlation analysis of genetic distance matrices based on 3 types of genetic markers
All cluster analyses were based on the genetic distance or genetic similarity coefficient generated by genetic markers between samples; in the present study, the coefficients of    (Fig. 9). These results may be due to the use of different number of markers.

Discussion
Phenotype is the result of the interaction of genotype with environment. A given genotype can be expressed as different phenotypes in different environments; this is known as the plant inherent phenotypic plasticity, which is different from the genotype. In the study of germplasm resources with more samples, economical and effective method with the use of phenotype data to study population genetic structure and genetic diversity is also a very important at this early stage. The selected 15 traits belong to quantitative traits and are greatly affected by environment; hence it is not recommended to use them for population genetic structure analysis. In recent decades, SSR markers, which represent second-generation DNA molecular markers, have been widely used for plant population genetic analysis, phylogenetic reconstruction, and quantitative trait mapping. All kinds of DNA markers are different and results generated by different DNA markers reflect different polymorphic region in the genome and can reveal various information contained in the genome. Theoretically, the more markers used, the more accurate results will be. SSR markers are mostly distributed in centromeres, telomeres, introns, and 3' untranslated regions (UTR). Most of these markers are non-functional genetic markers and do not affect the application of SSR marker in clustering analysis. While the population genetic variation explained by SNPs was larger than that explained by SSRs (for example, in PC analysis), the accumulative contribution rate of the first three major factors analyzed by SSR was only 15.76%, far less than that using SNPs data (40.10%), indicating that the more DNA polymorphism, the more accurate the population variation can be explained. For association analysis between markers and traits, greater number of polymorphic sites is associated with higher mapping resolution. A natural population often contains multiple sub-populations, which could result in high degree of LD within the tested population and lead to pseudo-association between markers and traits. Therefore, LD and haplotype studies are necessary before carrying out association analysis. The core 48 pairs of SSR markers as well 72,824 SNPs had rich bands and high polymorphism in Indica rice genome; clustering result of SSR was concordant with that of SNPs, but different from phenotypic traits clustering. All analyses of the population genetic structure were based on the estimation of genetic distance or genetic similarity coefficient matrix between samples. Analyses were conducted using three methods: PC, UPGMA, and Bayesian clustering. Bayesian algorithm is more practical than UPGMA and PC analysis no matter which genetic marker is used given the prior pedigree knowledge of the 93 samples. At the same time, the size of gene flow of each sample can be seen from the population genetic structure graph based on Bayesian algorithm.
Through the analysis of different types of materials (57 restorer lines, 19 sterile lines, and 17 special rice), the results showed that the genetic basis of the restorer lines was richer than that of the maintainer lines, which Correlation between the genetic distance matrices generated using different genetic markers was consistent with the conclusion of Ying Jiezheng et al. [24]. The main reason may be that most CMS (cytoplasmic male sterile) lines currently used in production are related to cultivars such as Zhenshan 97B, II-32B, Zhong9a and Gang46a and may be derived from Aizazhan and Aijiaonante, which originated from dwarf rice varieties. At present, the restorer lines used in combination production originate from the Yangtze River Basin of China, Sichuan, Southeast Asia, South Korea, etc, and were created by crossing Indica and Japonica rice. Special rice has an abundant genetic basis compared to other rice germplasm resources and has high breeding potential.

Conclusions
Higher number of genetic markers is related to higher explained population variation, especially functional DNA markers. The above showed that it is difficult to make certain the genetic nature of rice germ-resources using phenotype traits clustering. Clustering results based on different genetic markers showed that the genetic basis of 93 samples was single. Average genetic distance was 0.29 based on 72,824 SNPs of 93 samples, which may be due to many reasons, such as the wide exchange of variety resources among breeding units in the process of breeding, and similar breeding goals. Genetic effects in populations depend on the opportunity distribution of MAFs across the genome-wide, and different populations have different MAF values. Although the gene flow in the population composed of 93 samples was relatively large, the average MAF of the population was only 0.21, indicating the genetic structure of 93 samples is simple and lacked rare alleles. Though the amount of colored rice only take up a small proportion of rice resource in this study, it arose an extensive attention all over the world, due to its characteristics which include special nutrition, health care and artificial utilization. Measures to improve the genetic diversity of rice cultivars in the Qinba area are important in the future.

Plant materials
A total of 93 samples were collected from the Shaanxi Rice Research Institute (Hanzhong city, China), comprising 57 restoring lines, 19 maintainer lines, and 17 special rice (Special rice refers to rice with special genetic traits and uses such as colored rice genotypes including black rice, purple rice, red rice, green rice and yellow rice and aromatic rice germplasm, which only research colored rice in this study.), which were representative of the diversity of Oryza sativa Xian group present in the Qinba area in China.

Phenotyping
Six plants in the middle of each plot were selected to investigate the values of agronomic, economic and quality traits according to "Recording items, methods and standards of national rice variety test and observation" as well as "National Standard of GBT 17891-1999 high quality paddy". The 15 selected phenotypic traits included sowing date, plant height, leaf length, leaf width, effective number of panicles per plant, panicle length, total number of grains per panicle, number of filled grains per panicle, 1000-grain weight, browning rate, milled rice rate, head milled rice rate, chalky grain rate, degree of chalkiness, and length/width ratio; the averages of the threeyear data were used as the phenotypic data.

Phenotypic traits statistical analysis
The mean value (x), standard deviation (δ), and coefficient of variation (CV) were computed. Shannon-Weiner index (H') was calculated according to the following equation: H'=-∑P i lnP i , where P i is the proportion of samples ranked at i th grade for a given phenotypic trait among all samples (all of the phenotypic traits were divided into 10 grades by assigning values less than x-2δ as 1 th grade and those greater than x + 2δ as 10 th grade, with inter-grade difference of 0.5δ for the remaining grades). All of phenotypic trait data were the standardized using Z-scores, and hierarchical cluster analysis was performed using between-groups linkage method based on Euclidean distance. The above analysis was carried out with the IBM SPSS statistics 22.0 software; MEGA7.0 software was used for editing and visualizing cluster results.

SSR genotyping
The genomic DNA of 93 samples was extracted from fresh leaves using the SDS technique and detected with 0.8% agarose gel electrophoresis. The 48 SSR primers were synthesized by Beijing Aoke Biotechnology Co., Ltd. (Beijing, China). PCR were carried out in a 10 μL volume containing 1 μL DNA template, 2 μL (10 μM) of forward and reverse primers (1μL each), 5μL 2×Taq Master Mix, and 2 μL RNase-free water. Reactions were programed as follows: initial denaturation at 94.0°C for 5 minutes, denaturation at 94.0°C for 1 minute, annealing at 50-60.0°C for 1 minute, and extension at 72.0°C for 1 minute, for a total of 35 cycles. Electrophoresis was performed using 8% non-denaturing polyacrylamide gel under 95V voltage; bands were visualized via silver staining. Following electrophoresis, each amplification band corresponded to a primer hybridization locus and was considered as an effective molecular marker. Each polymorphic band detected by a same given primer represented an allelic mutation. In order to generate molecular data matrices, clear bands for each fragment were scored in every accession for each primer pair and recorded as 1(presence of a fragment), 0(absence of a fragment), and 9(complete absence of band).

SSR marker efficiency analysis
The value of the polymorphism information content (PIC) was calculated using the PIC_Calc 0.6 program (http:// www. bio-soft. net/ dna/ pic. htm). The level of polymorphism of each marker was assessed by the polymorphism information content, which measures the extent of genetic variation: PIC values smaller than 0.25 indicates low levels of polymorphism associated to a locus, PIC values between 0.25 and 0.5 imply moderate levels of polymorphism, while PIC values greater than 0.5 indicate high levels of polymorphism [25].

SNPs genotyping
The genomic DNA of 93 samples was digested using the NlaIII and MseI enzymes. GBS was performed using the Illumina Hiseq 2000 platform of Novo Gene Bioinformatics Technology Co.,Ltd (Beijing, China). The SNPs data obtained with NlaIII and MseI digestion were recorded as SNPs -NlaIII , SNPs -MseI , respectively. Polymorphism filtering of SNPs was done with dp., Miss and MAF of 2, 0.3 and 0.05, respectively, followed by annotation based on the reference genome (ftp:// ftp. ensem blgen omes. org/ pub/ plants/ relea se-37/ fasta/ oryza_ indica/ dna/).

LD and haplotype construction
Genotype data were then used to calculate LD between SNPs and to construct haplotypes using the EM algorithm implemented in PLINK1.07 (https:// www. cog-genom ics. org/ plink2). The commands "--r2" and "--blocks" were used to calculate LD and assign SNPs to their respective haplotypes by calculating inter-maker LD within a 200kb window, respectively. Figures were constructed using the Origin8 platform (http:// www. origi nlab. com/).

AMOVA and gene flow
A total of 72,824 SNPs were employed to analyze molecular variance (AMOVA) and gene flow. The components of variance attributable to different varieties and breeding lines were estimated from the genetic distance matrix using the Tajima & Nei method, as specified in the AMOVA procedure in ARLEQUIN 3.1 [26]. A nonparametric permutation procedure with 9999 permutations was used to test the significance of variance components associated with the different possible levels of genetic structure in this study. The pairwise Fst values, a value of F statistic analogs computed from AMOVA, were used to compare genetic distances between any two groups.

Bayesian clustering
STRU CTU RE 2.3.4 (http:// taylo r0. biolo gy. ucla. edu/ struc tureH arves teroy base. org/ tools. php), which applies a Bayesian clustering algorithm, was used to simulate population genetic structure based on SSR and SNPs data, respectively. Using a membership probability threshold of 0.60, population K values from 1 to 5 were simulated with 5 iterations for each K using 10,000 burn-in periods followed by 10,0000 Markov Chain Monte Carlo iterations in order to obtain an estimate of the most probable number of populations. Delta K was plotted against K values; the best number of clusters was determined following the method proposed by Evanno et al [28], and obtained via the Structure Harvester platform (http:// taylo r0. biolo gy. ucla. edu/ struc tureH arves ter/) [29].